Sadaf Sadeghian 810195419

MNIST

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
In [164]:
trainData = pd.read_csv('./dataset/MNIST/train_data.csv', header = None)
trainLabels = pd.read_csv('./dataset/MNIST/train_label.csv', header = None, names = ['label'])
testData = pd.read_csv('./dataset/MNIST/test_data.csv', header = None)
testLabels = pd.read_csv('./dataset/MNIST/test_label.csv', header = None, names = ['label'])

Visualization

In [165]:
def visualize(digit):
    image = np.array(digit, dtype='float')
    pixels = image.reshape((28, 28))
    plt.imshow(pixels, cmap='gray')
    plt.show()
In [166]:
digitNine = trainData.loc[trainLabels['label']==9]
nineDigit =digitNine.iloc[0]
visualize(nineDigit)

K-Nearest Neighbors

1.
K-Nearest Neighbors is a classification algorithms in Machine Learning which belongs to the supervised learning domain.
KNN algorithm is based on the assumption that similar data points are close to each other. Given an unclassified point, we can assign it to a group by observing what group its nearest neighbors belong to; So a proper function should be used based on the problem for calculating the distance between points on the feature space to find nearest neighbors.

2.

In [5]:
from sklearn.neighbors import KNeighborsClassifier
In [6]:
for i in range(1, 7):
    neighTest = KNeighborsClassifier(n_neighbors = i)
    neighTest.fit(trainData.values, trainLabels.values.ravel() )
    print("k =",i, " score =",  neighTest.score(testData.values, testLabels.values.ravel()))
k = 1  score = 0.86
k = 2  score = 0.87
k = 3  score = 0.895
k = 4  score = 0.9
k = 5  score = 0.89
k = 6  score = 0.885

Using score function we can observe that by increasing k from 1 to 4, the score increases and after that score decreases as we increase k.
So we will use k = 4

In [7]:
k= 4
neighClf = KNeighborsClassifier(n_neighbors=k)
neighClf.fit(trainData.values, trainLabels.values.ravel())
Out[7]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=4, p=2,
                     weights='uniform')
In [8]:
testNum = 200
pred = neighClf.predict(testData.iloc[0:testNum].values)

correctPred = 0;
for index, predicted in enumerate(pred):
    if predicted == testLabels.values[index]:
        correctPred += 1
print("Accuracy: ", correctPred / testNum)   

for index, digit in enumerate(testData.iloc[0:5].values):
    visualize(digit)
    print("Predicted digit:", pred[index])
Accuracy:  0.9
Predicted digit: 2
Predicted digit: 7
Predicted digit: 5
Predicted digit: 2
Predicted digit: 3

3.

In [9]:
n1 = 150
n2 = 500
testAcc = []
trainAcc = []
kRange = range(1, 7)

for i in kRange:
    neighTest = KNeighborsClassifier(n_neighbors = i)
    neighTest.fit(trainData.values, trainLabels.values.ravel() )
    
    trainAcc.append(neighTest.score(trainData.iloc[0:n2].values, trainLabels.iloc[0:n2].values.ravel()))
    testAcc.append(neighTest.score(testData.iloc[0:n1].values, testLabels.iloc[0:n1].values.ravel()))

plt.plot(kRange, trainAcc)
plt.plot(kRange, testAcc)
plt.legend(['Accuracy on train data', 'Accuracy on test data'], loc='lower right')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.title('Effect of n_neighbors on accuracy of k-nearest neighbors algorithm')
plt.show()

Accuracy on test data is maximum when k = 4

4.
Obviously accuracy on train data is maximum in k = 1 as each example will get its own label and by increasing k in accuracy decreases and loss increases as it will match the example with more samples which are not that sample exactly.

5.

In [10]:
testExampleIndex = 2

print("Test example:")
visualize(testData.iloc[testExampleIndex].values)
thisNeigh = neighClf.kneighbors([testData.iloc[testExampleIndex]])

print("Neighbors: ")
for n in thisNeigh[1][0]:
    visualize(trainData.iloc[n].values)
Test example:
Neighbors: 

6.

  • Computationally expensive : We need to compute distance of each example to all training samples. As it is a lazy learning method, it requires that most of computation be done during testing, rather than during training which can be an issue for large datasets and make the algorithm slow.
  • Optimal number of neighbors: Choosing the optimal number of neighbors to be consider while classifying the new data entry is one of the issues.
  • High memory requirement: All or almost all of the training samples need to be stored in memory.

Desicion Tree

7.
Decison trees are a non-parametric supervised learning method used for classification.
A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label.

8.

In [11]:
from sklearn.tree import DecisionTreeClassifier
In [12]:
treeClf = DecisionTreeClassifier(max_depth = 13, random_state = 0, min_samples_leaf = 1, max_leaf_nodes = 170
                                 , min_samples_split  = 4, max_features = 200)
treeClf.fit(trainData.values, trainLabels.values.ravel())
Out[12]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=13,
                       max_features=200, max_leaf_nodes=170,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=4,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=0, splitter='best')
In [13]:
pred = treeClf.predict(testData.iloc[0:200].values)

correctPred = 0;
for index, predicted in enumerate(pred):
    if predicted == testLabels.values[index]:
        correctPred += 1
print("Accuracy:", correctPred / testNum, "\n\n") 

print("Some of test examples")
print("_"*50)
for index, digit in enumerate(testData.iloc[0:5].values):
    print("Test example:")
    visualize(digit)
    print("Predicted digit:", pred[index])
    print("_"*50)
Accuracy: 0.73 


Some of test examples
__________________________________________________
Test example:
Predicted digit: 5
__________________________________________________
Test example:
Predicted digit: 7
__________________________________________________
Test example:
Predicted digit: 8
__________________________________________________
Test example:
Predicted digit: 2
__________________________________________________
Test example:
Predicted digit: 5
__________________________________________________

9.

In [14]:
testAccTree = []
trainAccTree = []
maxDepthRange = range(6, 20)

for i in maxDepthRange:
    treeTest = DecisionTreeClassifier(max_depth = i, random_state = 0, min_samples_leaf = 1, max_leaf_nodes = 170
                                      , min_samples_split  = 4, max_features = 200)
    treeTest.fit(trainData.values, trainLabels.values.ravel())
    
    trainAccTree.append( treeTest.score(trainData.values, trainLabels.values.ravel()))
    testAccTree.append( treeTest.score(testData.values, testLabels.values.ravel()))
    
plt.plot(maxDepthRange, trainAccTree)
plt.plot(maxDepthRange, testAccTree)
plt.legend(['Accuracy on train data', 'Accuracy on test data'], loc='lower right')
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.title('Effect of max_depth on decision tree algorithm')
plt.show()

The optimum accuracy on test data occurs when max_depth = 12

10.

In [39]:
from sklearn.tree import export_graphviz
import pydotplus
from IPython.display import Image 

dotData = export_graphviz(treeClf, out_file=None, filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dotData)
Image(graph.create_png())
Out[39]:

11.
Overfitting in decision tree occurs when the decisions of decision tree are not general anymore and it makes its decisions specifically based on the training examples.

Thus increasing max_depth can lead to overfitting. As it is shown below, model accuracy is high with respect to the data used in training and it approaches 1 but drops to 0.7 with test data which indicates overfitting.

In [16]:
testAccTree = []
trainAccTree = []
maxDepthRange = range(6, 20)

for i in maxDepthRange:
    treeTest = DecisionTreeClassifier(max_depth = i, random_state = 0, min_samples_leaf = 1, max_leaf_nodes = 170
                                      , min_samples_split  = 4, max_features = 200)
    treeTest.fit(trainData.values, trainLabels.values.ravel())
    
    trainAccTree.append( treeTest.score(trainData.values, trainLabels.values.ravel()))
    testAccTree.append( treeTest.score(testData.values, testLabels.values.ravel()))
    
plt.plot(maxDepthRange, trainAccTree)
plt.plot(maxDepthRange, testAccTree)
plt.legend(['Accuracy on train data', 'Accuracy on test data'], loc='lower right')
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.title('Effect of max_depth on decision tree algorithm')
plt.show()
In [17]:
testAccTree = []
trainAccTree = []
kRange = range(50, 350, 20)

for i in kRange:
    treeTest = DecisionTreeClassifier(max_depth = 12, random_state = 0, min_samples_leaf = 1, max_leaf_nodes = i
                                      , min_samples_split  = 4, max_features = 200)
    treeTest.fit(trainData.values, trainLabels.values.ravel())
    
    trainAccTree.append( treeTest.score(trainData.values, trainLabels.values.ravel()))
    testAccTree.append( treeTest.score(testData.values, testLabels.values.ravel()))
    
plt.plot(kRange, trainAccTree)
plt.plot(kRange, testAccTree)
plt.legend(['Accuracy on train data', 'Accuracy on test data'], loc='lower right')
plt.xlabel('max_leaf_nodes')
plt.ylabel('accuracy')
plt.title('Effect of max_leaf_nodes on decision tree algorithm')
plt.show()

Another parameter that can cause overfitting is max_leaf_nodes that gives the number of leafs in decision tree.By increasing max_leaf_nodes we can have almost each of the training examples in leafs and thus the model will not be general and become overfitted.

Random Forest

12.
Random Forest is a supervised learning algorithm which can be used for both classification and regression problems.
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction; to say more, For classifying a new object from an input vector, we put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes.

13.

In [18]:
from sklearn.ensemble import RandomForestClassifier
In [19]:
randomClf = RandomForestClassifier(n_estimators=120, max_depth=11, random_state=0, max_features = 8
                                   , min_samples_split = 3)
randomClf.fit(trainData.values, trainLabels.values.ravel())
Out[19]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=11, max_features=8, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=3,
                       min_weight_fraction_leaf=0.0, n_estimators=120,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)
In [20]:
n1 = 200
n2 = 500

print("Accuracy on TrainData : ", randomClf.score(trainData.iloc[0:n2].values, trainLabels.iloc[0:n2].values.ravel()))
print("Accuracy on TestData :",randomClf.score(testData.iloc[0:n1].values, testLabels.iloc[0:n1].values.ravel()))
Accuracy on TrainData :  1.0
Accuracy on TestData : 0.93

14.

In [21]:
n1 = 200
n2 = 500
testAcc = []
trainAcc = []
kRange = range(1, 14)

for i in kRange:
    randomTest = RandomForestClassifier(n_estimators=120, max_depth=i, random_state=0, max_features = 8
                                        , min_samples_split = 3)
    randomTest.fit(trainData.values, trainLabels.values.ravel())
    
    trainAcc.append(randomTest.score(trainData.iloc[0:n2].values, trainLabels.iloc[0:n2].values.ravel()))
    testAcc.append(randomTest.score(testData.iloc[0:n1].values, testLabels.iloc[0:n1].values.ravel()))

plt.plot(kRange, trainAcc)
plt.plot(kRange, testAcc)
plt.legend(['Accuracy on train data', 'Accuracy on test data'], loc='lower right')
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.title('Effect of max_depth on random forest algorithm')
plt.show()

Accuracy on test data is maximum when max_depth = 11

15.

n_estimators :
The number of trees in the random forest

In [22]:
n1 = 200
n2 = 500
testAcc = []
trainAcc = []
kRange = range(20, 200, 10)

for i in kRange:
    randomTest = RandomForestClassifier(n_estimators=i, max_depth=11, random_state=0, max_features = 8
                                        , min_samples_split = 3)
    randomTest.fit(trainData.values, trainLabels.values.ravel())
    
    trainAcc.append(randomTest.score(trainData.iloc[0:n2].values, trainLabels.iloc[0:n2].values.ravel()))
    testAcc.append(randomTest.score(testData.iloc[0:n1].values, testLabels.iloc[0:n1].values.ravel()))

plt.plot(kRange, trainAcc)
plt.plot(kRange, testAcc)
plt.legend(['Accuracy on train data', 'Accuracy on test data'], loc='lower right')
plt.xlabel('n_estimators')
plt.ylabel('accuracy')
plt.title('Effect of n_estimators on random forest algorithm')
plt.show()

Accuracy on test data is maximum when n_estimators = 120

min_samples_split:
Minimum number of data points placed in a node before the node is split, in other words, minimum number of samples required to split a node

In [23]:
n1 = 200
n2 = 500
testAcc = []
trainAcc = []
kRange = range(2,10)

for i in kRange:
    randomTest = RandomForestClassifier(n_estimators=120, max_depth=11, random_state=0, max_features = 8
                                        , min_samples_split = i)
    randomTest.fit(trainData.values, trainLabels.values.ravel())
    
    trainAcc.append(randomTest.score(trainData.iloc[0:n2].values, trainLabels.iloc[0:n2].values.ravel()))
    testAcc.append(randomTest.score(testData.iloc[0:n1].values, testLabels.iloc[0:n1].values.ravel()))

plt.plot(kRange, trainAcc)
plt.plot(kRange, testAcc)
plt.legend(['Accuracy on train data', 'Accuracy on test data'], loc='center right')
plt.xlabel('min_samples_split')
plt.ylabel('accuracy')
plt.title('Effect of min_samples_split on random forest algorithm')
plt.show()

Accuracy on test data is maximum when min_samples_split = 3

max_features:
Number of features to consider when looking for the best split

In [24]:
n1 = 200
n2 = 500
testAcc = []
trainAcc = []
kRange = range(2,20)

for i in kRange:
    randomTest = RandomForestClassifier(n_estimators=120, max_depth=11, random_state=0, max_features = i
                                        , min_samples_split = 3)
    randomTest.fit(trainData.values, trainLabels.values.ravel())
    
    trainAcc.append(randomTest.score(trainData.iloc[0:n2].values, trainLabels.iloc[0:n2].values.ravel()))
    testAcc.append(randomTest.score(testData.iloc[0:n1].values, testLabels.iloc[0:n1].values.ravel()))

plt.plot(kRange, trainAcc)
plt.plot(kRange, testAcc)
plt.legend(['Accuracy on train data', 'Accuracy on test data'], loc='center right')
plt.xlabel('max_features')
plt.ylabel('accuracy')
plt.title('Effect of max_features on random forest algorithm')
plt.show()

Accuracy is maximum on test data when max_features = 8

16.

In [25]:
n1 = 200
n2 = 500

randomClf = RandomForestClassifier(n_estimators=1, max_depth=7, random_state=0, max_features = 8
                                   , min_samples_split = 3)
randomClf.fit(trainData.values, trainLabels.values.ravel())
print("Single tree random forest")
print("   Accuracy on TrainData : ", randomClf.score(trainData.iloc[0:n2].values, trainLabels.iloc[0:n2].values.ravel()))
print("   Accuracy on TestData :",randomClf.score(testData.iloc[0:n1].values, testLabels.iloc[0:n1].values.ravel()))
Single tree random forest
   Accuracy on TrainData :  0.784
   Accuracy on TestData : 0.525
In [26]:
treeClf = DecisionTreeClassifier(max_depth = 16, random_state = 0)
treeClf.fit(trainData.values, trainLabels.values.ravel())
print("Decision tree")
print("   Accuracy on TrainData : ", treeClf.score(trainData.iloc[0:n2].values, trainLabels.iloc[0:n2].values.ravel()))
print("   Accuracy on TestData :",treeClf.score(testData.iloc[0:n1].values, testLabels.iloc[0:n1].values.ravel()))
Decision tree
   Accuracy on TrainData :  1.0
   Accuracy on TestData : 0.725

As we can see above, accuracy of single tree random forest and decision tree are not same and the accuracy of decision tree is better. Because they are different in the number of features to consider when looking for the best split at each tree node: while decision tree considers all the features, random forest tree considers a random subset of them(of size equal to the parameter max_features)

Logistic Regression

17.
Logistic regression is a classification algorithm used to assign observations to a discrete set of classes.Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

18.

In [27]:
from sklearn.linear_model import LogisticRegression


logisticRegClf = LogisticRegression(solver='lbfgs', max_iter = 450, multi_class = 'auto')
logisticRegClf.fit(trainData.values, trainLabels.values.ravel())
Out[27]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=450,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [28]:
n1 = 200
n2 = 500

print("Accuracy on TrainData : ", logisticRegClf.score(trainData.iloc[0:n2].values, trainLabels.iloc[0:n2].values.ravel()))
print("Accuracy on TestData :",logisticRegClf.score(testData.iloc[0:n1].values, testLabels.iloc[0:n1].values.ravel()))
Accuracy on TrainData :  1.0
Accuracy on TestData : 0.85

K-Means

19.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

20.

In [100]:
from sklearn.cluster import KMeans

kmeansClf = KMeans(n_clusters=10, random_state=0)
clusters = kmeansClf.fit(trainData.values)
In [53]:
centers = kmeansClf.cluster_centers_
for c in centers:
    visualize(c)

21.

In [89]:
labels = []

neigh = KNeighborsClassifier(n_neighbors = 1)
neigh.fit(trainData.values, trainLabels.values.ravel() )

for c in centers:
    visualize(c)
    label = neigh.predict([c])[0]
    labels.append(label)
    print("label : ", label)
    
label :  1
label :  3
label :  0
label :  3
label :  4
label :  3
label :  9
label :  6
label :  9
label :  1
label :  7
label :  9
label :  7
label :  6
label :  3
label :  6
label :  8
label :  2
label :  0
label :  5
label :  5
label :  3
label :  4
label :  0
label :  0
label :  7
label :  0
label :  0
label :  9
label :  6
label :  9
label :  6
label :  3
label :  1
label :  3
label :  8
label :  6
label :  0
label :  1
label :  2
label :  5
label :  6
label :  9
label :  2
label :  5
label :  4
label :  7
label :  9
label :  5
label :  6
label :  8
label :  7
label :  2
label :  7
label :  9
label :  8
label :  9
label :  5
label :  2
label :  3
label :  7
label :  4
label :  1
label :  1
label :  2
label :  6
label :  2
label :  2
label :  4
label :  4
label :  5
label :  7
label :  5
label :  2
label :  0
label :  7
label :  3
label :  9
label :  4
label :  3
label :  0
label :  6
label :  8
label :  6
label :  1
label :  8
label :  2
label :  8
label :  0
label :  5
In [55]:
def calcAccuracy(predictedLabels, labels):
    correctPred = 0;
    for index, predicted in enumerate(predictedLabels):
        if predicted == labels.values[index]:
            correctPred += 1
    return (correctPred / labels.size)
In [56]:
pred = kmeansClf.predict(testData.values)

def toLabel(n): 
    return labels[n] 
   
predictedLabels = map(toLabel, pred) 
print("Accuracy: ", calcAccuracy(predictedLabels, testLabels)) 
Accuracy:  0.555

22.

In [88]:
from sklearn.cluster import KMeans

labelsTest = []
def toLabelTest(n): 
    return labelsTest[n] 

testAcckmeans = []
trainAcckmeans = []

kRange = range(10, 100, 10)
for i in kRange:
    kmeansTest = KMeans(n_clusters=i, random_state=0)
    clusters = kmeansTest.fit(trainData.values)
    
    labelsTest = []
    centers = kmeansTest.cluster_centers_
    for c in centers:
        label = neigh.predict([c])[0]
        labelsTest.append(label)
   
    pred = kmeansTest.predict(testData.values)
    trainPred = kmeansTest.predict(trainData.iloc[0:500].values)
    
    predictedLabels = map(toLabelTest, pred) 
    predictedLabelsTrain = map(toLabelTest, trainPred)
    
    testAcckmeans.append(calcAccuracy(predictedLabels, testLabels))
    trainAcckmeans.append(calcAccuracy(predictedLabelsTrain, trainLabels.iloc[0:500]))

plt.plot(kRange, trainAcckmeans)
plt.plot(kRange, testAcckmeans)
plt.legend(['Accuracy on train data', 'Accuracy on test data'], loc='lower right')
plt.xlabel('n_clusters')
plt.ylabel('accuracy')
plt.title('Effect of n_clusters on kmeans')
plt.show()

Increasing n_clusters boosts up accuracy, because having more clusters means more cluster centers and this means smaller clusters with more accuracy. The extreme case is when each data point is considered its own cluster (when k equals the number of data points, n) thus accuracy on train data will be 1.

23.

k-means will have a better time complexity for answering queries as we will find center of clusters once and then for evaluation each new example, that example should be compared to only these centers. On the other hand, KNN algorithm would compare each new example with all of the data entries in training data to find k nearest neighbors of that example and this will need more time that k-means.

Summary

24.
Random Forest
Accuracy on test data : 0.93
Hyper parameters:

  • n_estimators = 120
  • max_depth = 11
  • random_state = 0
  • max_features = 8
  • min_samples_split = 3

25.

In [ ]:
from sklearn.metrics import confusion_matrix

randomForestClf = RandomForestClassifier(n_estimators=120, max_depth=11, random_state=0, max_features = 8
                                   , min_samples_split = 3)
randomForestClf.fit(trainData.values, trainLabels.values.ravel())
pred = randomForestClf.predict(testData)

cm = confusion_matrix(testLabels, pred)

print('Confusion matrix')
print(cm)
In [ ]:
def plot_confusion_matrix(cm):
    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    ax.figure.colorbar(im, ax=ax)
    ax.set(xticks=np.arange(cm.shape[1]), yticks=np.arange(cm.shape[0]), title='Confusion matrix',
           ylabel='True label', xlabel='Predicted label')

    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], 'd'), ha="center", va="center", 
                    color="white" if cm[i, j] > thresh else "black")
    plt.show()
In [ ]:
plot_confusion_matrix(cm)

26.

In [50]:
exampleIndex = 190

example = testData.iloc[exampleIndex:exampleIndex + 1].values
exampleLabel = testLabels.iloc[exampleIndex].values
visualize(example)

neighPred = neighClf.predict(example)[0]
if(neighPred == exampleLabel):
    print('KNN : Predicted correctly.')
else:
    print('KNN : Predicted incorrectly. Predicted ', neighPred)

treePred = treeClf.predict(example)[0]
if(treePred == exampleLabel):
    print('Decision Tree : Predicted correctly.')
else:
    print('Decision Tree : Predicted incorrectly. Predicted ', neighPred)

randomPred = randomClf.predict(example)[0]
if(randomPred == exampleLabel):
    print('Random Forest : Predicted correctly.')
else:
    print('Random Forest : Predicted incorrectly. Predicted ', randomPred)

logisticRegPred = logisticRegClf.predict(example)[0]
if(logisticRegPred == exampleLabel):
    print('Logistic Regression : Predicted correctly.')
else:
    print('Logistic Regression : Predicted incorrectly. Predicted ', logisticRegPred)

kmeansPred = labels[kmeansClf.predict(example)[0]]
if(kmeansPred == exampleLabel):
    print('K-Means : Predicted correctly.')
else:
    print('K-Means : Predicted incorrectly. Predicted ', kmeansPred)
KNN : Predicted correctly.
Decision Tree : Predicted correctly.
Random Forest : Predicted correctly.
Logistic Regression : Predicted incorrectly. Predicted  5
K-Means : Predicted correctly.

dimensionality-reduction

27.
Principal Component Analysis(PCA) is a dimensionality-reduction method that is used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data easier and faster for machine learning algorithms.

PCA has 3 steps:

  • 1) Standarization
    Standardizing the range of the variables leads to equally contribution of each individual variable to the analysis. It is critical because if there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges which will lead to biased results.
  • 2) Covariance Matrix computation
    The aim of this step is to see if there is any relationship between variables. Because sometimes, variables are highly correlated in such a way that they contain redundant information.
    If covariance of two variables is positive, two variables increase or decrease together.(correlated)and if negative then, One increases when the other decreases.(not correlated)
  • 3) Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
    Eigenvectors and eigenvalues are the linear algebra concepts. They need to be computed from the covariance matrix in order to determine the principal components of the data.
    Principal components are new variables that are constructed as linear combinations of the initial variables. These combinations are done in such a way that these variables are uncorrelated and most of the information within the initial variables is compressed into the first components.

28.

In [161]:
from sklearn.decomposition import PCA

pca = PCA(n_components=20, random_state = 0)
decomposedTrainData = pca.fit_transform(trainData)
decomposedTestData = pca.transform(testData)

decNeigh = KNeighborsClassifier(n_neighbors = 1)
decNeigh.fit(decomposedTrainData, trainLabels.values.ravel() )

kmeansClf = KMeans(n_clusters=10, random_state=0)
clusters = kmeansClf.fit(decomposedTrainData)

centers = kmeansClf.cluster_centers_

labels = []
for c in centers:
    label = decNeigh.predict([c])[0]
    labels.append(label)
    
pred = kmeansClf.predict(decomposedTestData)
   
predictedLabels = map(toLabel, pred) 
print("Accuracy: ", calcAccuracy(predictedLabels, testLabels)) 
Accuracy:  0.575

Accuracy doesn't change significantly because PCA doesn't add any new feature and only decrease the dimensions. As you can see we achive to the accuracy close to out accuracy with only 20 features for each example which simplifies analyzing for K-Means and make it faster.

CIFAR-10

In [337]:
trainDataCIFAR = pd.read_csv('./dataset/CIFAR10/CIFAR10_train_data.csv', header = None)
trainLabelsCIFAR = pd.read_csv('./dataset/CIFAR10/CIFAR10_train_label.csv', header = None, names = ['label'])
valDataCIFAR = pd.read_csv('./dataset/CIFAR10/CIFAR10_val_data.csv', header = None)
valLabelsCIFAR = pd.read_csv('./dataset/CIFAR10/CIFAR10_val_label.csv', header = None, names = ['label'])
testDataCIFAR = pd.read_csv('./dataset/CIFAR10/CIFAR10_test_data.csv', header = None)
In [322]:
randomForestClf = RandomForestClassifier(n_estimators=300, max_depth=12, random_state=0, max_features = 8
                                   , min_samples_split = 3)
randomForestClf.fit(trainDataCIFAR.values, trainLabelsCIFAR.values.ravel())
Out[322]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=12, max_features=8, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=3,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)
In [323]:
n1 = 1000
n2 = 2000

print("Accuracy on TrainData : ", randomForestClf.score(trainDataCIFAR.iloc[0:n2].values, 
                                                        trainLabelsCIFAR.iloc[0:n2].values.ravel()))
print("Accuracy on TestData :",randomForestClf.score(valDataCIFAR.iloc[0:n1].values, 
                                                     valLabelsCIFAR.iloc[0:n1].values.ravel()))
Accuracy on TrainData :  0.799
Accuracy on TestData : 0.445

Turn the images to grayscale

For many applications of image processing, color information does not help us identify important edges or other features. If we do not need color and it is not important in our dataset, for instance, if hue of different classes do not vary remakably then we can consider it noise and convert images to grayscale.

In [324]:
from skimage.color import rgb2gray

grayscaleTrainData = []
for i in range(49000):
    image = np.array(trainDataCIFAR.iloc[i], dtype='uint8').reshape(32, 32, 3)
    gray = rgb2gray(image).reshape(1024,)
    grayscaleTrainData.append(gray)
In [326]:
grayRandomForestClf = RandomForestClassifier(n_estimators=250, max_depth=11, random_state=0, max_features = 8
                                   , min_samples_split = 3)
grayRandomForestClf.fit(grayscaleTrainData, trainLabelsCIFAR.values.ravel())
Out[326]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=11, max_features=8, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=3,
                       min_weight_fraction_leaf=0.0, n_estimators=250,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)
In [327]:
grayscaleValData = []
for i in range(1000):
    image = np.array(valDataCIFAR.iloc[i], dtype='uint8').reshape(32, 32, 3)
    gray = rgb2gray(image).reshape(1024,)
    grayscaleValData.append(gray)
In [328]:
n1 = 1000
n2 = 2000
print("Accuracy on TrainData : ", grayRandomForestClf.score(grayscaleTrainData[0:n2], 
                                                        trainLabelsCIFAR.iloc[0:n2].values.ravel()))
print("Accuracy on TestData :",grayRandomForestClf.score(grayscaleValData[0:n1], 
                                                     valLabelsCIFAR.iloc[0:n1].values.ravel()))
Accuracy on TrainData :  0.682
Accuracy on TestData : 0.38

If there is an edge in color that is hard to detect in a grayscale image, or if we need to identify objects of known shade, then color information could be useful.
In this dataset as color plays an key role in identifying the animals, by converting images to grayscale the accuracy of classification drops.

PCA

Principal Component Analysis(PCA) is a dimensionality-reduction method that is used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data easier and faster for machine learning algorithms.

In [329]:
pca = PCA(n_components=70, random_state = 0)
decomposedTrainDataCIFAR = pca.fit_transform(trainDataCIFAR)
decomposedValDataCIFAR = pca.transform(valDataCIFAR)

pcaRandomForestClf = RandomForestClassifier(n_estimators=250, max_depth=11, random_state=0, max_features = 8
                                   , min_samples_split = 3)
pcaRandomForestClf.fit(decomposedTrainDataCIFAR, trainLabelsCIFAR.values.ravel())

pred = pcaRandomForestClf.predict(decomposedValDataCIFAR)   

print("Accuracy: ", calcAccuracy(pred, valLabelsCIFAR)) 
Accuracy:  0.448

Accuracy of classification using PCA is 0.003 more compared with using only random forest .

Random Projection

Random projection reduces the dimensionality of the data by trading a controlled amount of accuracy(as additional variance) for faster processing times and smaller model sizes.
Gaussian random projection reduces the dimensionality by projecting the original input space on a randomly generated matrix where components are drawn from the following distribution N(0, 1 / n_components).

In [334]:
from sklearn import random_projection

transformer = random_projection.GaussianRandomProjection(random_state = 0, eps = 0.5)
transTrainDataCIFAR = transformer.fit_transform(trainDataCIFAR)
transValDataCIFAR = transformer.transform(valDataCIFAR)

transRandomForestClf = RandomForestClassifier(n_estimators=250, max_depth=11, random_state=0, max_features = 8
                                   , min_samples_split = 3)
transRandomForestClf.fit(transTrainDataCIFAR, trainLabelsCIFAR.values.ravel())

pred = transRandomForestClf.predict(transValDataCIFAR)   

print("Accuracy: ", calcAccuracy(pred, valLabelsCIFAR)) 
Accuracy:  0.406

Random projection did not increase the accuracy.

In [339]:
decomposedTestDataCIFAR = pca.transform(testDataCIFAR)

pred = pcaRandomForestClf.predict(decomposedTestDataCIFAR) 

df = pd.DataFrame({'id' : range(1, 10001), 'predicted': pred})
df.to_csv(path_or_buf = './810195419.csv', index=False)